A selected history of expectation bias in physics 
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The beliefs of physicists can bias their results towards their expectations in a number of ways. We 
survey a variety of historical cases of expectation bias in observations, experiments and calculations. 
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It is a capital mistake to theorise before one has 
data. Insensibly one begins to twist facts to suit 
theories, instead of theories to suit facts. 

Sherlock Holme^ 

But are we sure of our observational facts? Scientific 
men are rather fond of saying pontihcally that one 
ought to be quite sure of one's observational facts 
before embarking on theory. Fortunately those who 
give this advice do not practice what they preach. 
Observation and theory get on best when they 
are mixed together, both helping one another in 
the pursuit of truth. It is a good rule to not put 
overmuch confidence in a theory until it has been 
confirmed by observation. I hope I shall not shock 
the experimental physicists too much if I add that it 
is also a good rule not to put overmuch confidence 
in the observational results that are put forward 
until they have been confirmed by theory. 

Sir Arthur S. Eddingtor^ 



I. INTRODUCTION 

Under an idealized view of science, the theoretical 
beliefs of experimenters should have no effect on what 
results are obtained. Theoretical beliefs can influence 
what experiments are done, and how they are interpreted 
and recieved, but the actual results are facts determined 
purely by nature. But in practice, one can worry that 
even honest and careful experimenters might still some- 
how transfer their biases to their end results. 

Sherlock Holmes warns against this by forbidding any 
theorizing in advance of the experiment. But it is difficult 
to sec how this is possible, let along desirable. And when 
we remember that Sir Arthur Conan Doyle, the author 
of Sherlock Holmes, uncritically accepted, and indeed ac- 
tively promoted, the "data" that fairies had been seen 
and photographed in Cottingley, EnglandS-, we begin to 
think that perhaps not all "experimental facts" should 
be accepted and treated equally. Eddington's character- 
ization of the complex interplay between theory and ex- 
periment seems to come closer to the truth. Perhaps the 
facts from experiments and observations are not always 
simple "facts," to be accepted at face value. 

Expectation bias is well-known, and widely discussed 
in the "softer" sciences, that study humans, such as psy- 
chology and medicine. But what role does it play in 



"hard" sciences, such as physics? The impression one 
gets from the history in physics textbooks is that it plays 
virtually no role, but this may be misleading. Because 
textbooks have as their primary goal to teach physics, 
extended discussions of how this physics was arrived at 
can seem like pointless and potentially confusing diver- 
sions; the emphasis is on how careful and correct rea- 
soning leads to correct results, even if that reasoning is 
retrospective, and ahistorical. Convoluted reasoning that 
was actually followed is replaced with the clearer reason- 
ing that "should" have been followed. Many readers will 
doubtlessly be surprised to discover that Planck was not 
led to quantization in an attempt to fix an "ultravio- 
let catastrophe" discovered by Rayleigh*"'' , that R0mer 
never calculated a value for the speed of light^, and that 
Einstein was not primarily motivated by the Michclson- 
Morley experiment in his invention of special relativity^. 
These, and other examples, are discussed in the reviews 
by Brush and Whitakes^. While this systematic bias 
in textbooks is understandable, it unfortunately tends to 
eliminate cases of expectation bias. Here, we consider 
some historical cases where beliefs of physicists have in- 
fluenced their results. 



II. OBSERVATIONS 

When human perception is at its limits, even ba- 
sic observations are unreliable, and may be biased to- 
wards what the observer expects to see. Van Helden 
surveyed the history of early telescopic observations of 
Saturn^. While most observers today, even with very 
low-resolution pictures, see a planet with rings, early ob- 
servers echoed Galileo's first observations, and saw and 
reported on a planet with two giant moons, one on each 
side. Van Helden's analysis convincingly shows that the 
inaccurate drawings of Saturn came not from poor tele- 
scopes, but from the influence of Galileo's flrst reports on 
later observers. Van Helden traces the evolution of ob- 
servations from a "two moon" phase to a "ring" phase. 

Another example can be seen in the N-rays of the early 
1900's.9. From 1903 to 1906, roughly 300 papers on N- 
rays were written by 100 scientists and doctors through- 
out France, describing in detail the properties of N-rays, 
as deduced by the faint lines they produced on white sur- 
faces in dark rooms. However, many researchers were un- 
able to see these lines. While there were early suspicions, 
it was not until several years had passed that the scientific 
community came to decide that the positive results arose 
from experimenters convincing themselves that they saw 
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faint lines and flashes that were not really there. In fact, 
anyone staring at a white surface under such conditions 
long enough will start to see dark spots and lines, and the 
judgment as to what constituted a flash turned out to be 
rather subjective. While this case has been used as an 
example of "pathological science," it is not clear how 
to distinguish pathological science from normal science 
in error. 

As Brewer points out, in almost all cases like these, 
the observations required were at the limits of human 
perception, and it is in these sorts of cases where we can 
expect subconscious biases to seep inii. When the per- 
ceptual data are unambiguous, these sorts of problems 
will generally not occur; no experimenter is likely to re- 
port that apples fall up. However, it is not always clear 
what constitutes "unambiguous" perceptual data. Bisson 
and Dennen noted that while Newton's experiments with 
color used a prism under conditions where spectral lines 
should have been clearly visible, Newton never reported 
seeing such lines^^. Boring argued that Newton's pre- 
suppositions essentially prevented him from seeing these 
spectral linesi^. Similarly, it is striking that there are no 
European records of the 1054 supernova. This supernova 
was visible to the naked eye by daylight for several weeks, 
and at night for over a year, and appears in Chinese and 
Japanese records. It has often argued that Europeans 
failed to record and discuss the supernova because of 
ideological committments to an unchanging universei^. 
These arguments are necessarily speculative, since it is 
difflcult to draw firm conclusions from the absence of 
records, but the lack of reports of such a dramatic event 
is rather striking. 

But one does not need to rely on historical cases to see 
this observational bias in practice. Physics Education 
Research has shown that students in physics classes do 
not always see what they are "supposed to" in demon- 
strations. Instead, students sometimes see what they ex- 
pect to see, even when the demonstration is supposed 
to provide a dramatic illustration of a counterintuitive 
result. Redish describes how his students "saw" a mar- 
ble leaving a curved track continue to curve, instead of 
seeing the straight line motion that they were supposed 
to observe^'^^. Gunstone and White reported similar re- 
sults when studying student observations of experiments 
on gravity; what the students saw was highly correlated 
with what they expected to seei&. 

However, even when the perceptual data is completely 
clear, other errors can occur. For example, when data 
is recorded by hand, it may simply be misrecorded. It 
might be thought that these errors, while regrettable, 
would have no overall effect in any direction. However, 
in 1978 Rosenthal reviewed 21 psychology studies that al- 
lowed analysis of errors in recording. Subject responses 
recorded by human observers were compared to those 
recorded by mechanical devices. Roughly 140,000 obser- 
vations, by over 300 observers, were looked at. Rosen- 
thal found that about two-thirds of the recording er- 
rors favored the hypothesis of the observer, significantly 



greater than the 50% that would be expected if the ob- 
servers were unbiasedii. Psychology differs substantially 
from physics, and has many more ways in which exper- 
imenter beliefs can influence their resultsJ^. However, 
simple recording errors should presumably play the same 
role in physics experiments where individual data points 
can immediately be seen as favoring or disfavoring a hy- 
pothesis. 



III. CALCULATIONS 

However, in many physics experiments, data can only 
be seen as favoring or disfavoring a hypothesis after 
lengthy calculations. This would seem to prevent record- 
ing errors from being biased in any particular direction. 
However, this will not save us from experimenter biases, 
because errors then can occur in calculations, and just as 
with the recording errors, will tend to favor experimenter 
predispositions. 

The first experiment to observe and quantitatively 
measure the pressure due to light, by Nichols and Hull, 
found agreement with Maxwell's theory to within 1%. 
However, Bell and Green later reanalyzed the data from 
these experiments, and found several mathematical er- 
rors: Nichols and Hull had used an incorrect value for the 
mechanical equivalent of heat, had taken some logarithms 
to base ten instead of base e, and made several mistakes 
over units and conversion factors. Once these mistakes 
were corrected, the results deviated from Maxwell's the- 
ory by 10%, which was still a success, but not nearly as 
good as the original 1% agreement reportediS. It seems 
reasonable to think that Nichols and Hull, seeing such 
good agreement with Maxwell's theory, had been happy 
to publish, but that if the mathematical errors had been 
in the opposite direction, they would have checked over 
their calculations more carefully. 

More subtle errors occurred in the searches for free 
quarks by Fairbank and his coUaborators^S. They mea- 
sured the charges on several niobium spheres, obtaining 
(-0.343±0.011)e, (0.001±0.033)e, and (-h0.328±0.007)e, 
in excellent agreement with the fractional charges of ±e/3 
expected for quarks. However, other experimenters failed 
to find evidence of free quarks. The calculations needed 
to turn Fairbank's raw data into charges were quite com- 
plex, and it was suggested that the Fairbank group might 
have been unconsciously biased by their expectations 
when doing them. The Fairbank group thus did a "blind" 
analysis. To do this, they added a random offset to their 
original data, and recalculated all charges without know- 
ing the value of this offset. Only when all calculations 
were completed was the offset revealed and removed. 
When this was done, the calculated charges were instead 
(-H0.189 ± 0.02)e, and (-hO.253 ± 0.02)e, which did not 
agree with the quark model, or indeed, any major theo- 
retical model. This case, and other cases of blind analysis 
in physics, can be found in the book by Franklin^*^. 

The effects of expectation bias can be particularly 
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strong in cases where the physicists have a great deal 
of freedom in their analysis, such as when "eyeballing" 
fits. In the 1920's, Millikan and Cameron ran experi- 
ments on cosmic rays, to test Millikan's theory of the 
"birth cry of atoms," which postulated that heavy ele- 
ments were formed in the space between the stars2i. Not 
only was this theory false, but they analyzed their cos- 
mic ray data with absorption formulae from Dirac that 
were subsequently discovered to be in error. Neverthe- 
less, Millikan and Cameron were able to obtain excellent 
agreement between their experimental results, and the 
"birth cry of atoms" theory! Their results gave quanti- 
tatively precise agreement for the production of oxygen, 
nitrogen, helium, and silicon, and they obtained rough 
agreement for the production of iron^^. They obtained 
these results by measuring ionization from cosmic rays as 
a function of depth, and then fitting by eye this absorp- 
tion curve as a sum of three exponentials. But such a fit 
gave them excessive freedom, so that their final results 
for the coefficients and exponents of these three exponen- 
tials, were essentially arbitrary. 

Calculations influenced by the desire to obtain certain 
results even date back to Newton's PrincipioS^. New- 
ton's calculation of the speed of sound in air agreed with 
experiment to within 0.1%. This calculation of the speed 
of sound was based on the assumption that the the air un- 
derwent isothermal compressions and expansions. How- 
ever, correct treatment of sound in air requires treating 
the compressions and expansions as adiabatic, so New- 
ton should have obtained a speed of sound 15% too low. 
It is often said that Newton did obtain a speed of sound 
15% too low; see, for example Ref. But he did not: 
he obtained results that agreed perfectly with the exper- 
iments at the time! He did this by applying a number of 
poorly-motivated "corrections" to his calculations, based 
on quantities he could not possibly have known (and did 
not know), such as the finite volume of space taken up 
by air molecules, and the effects of the water vapor in 
the air. He thus calculated corrections to the isothermal 
result that magically achieved a 0.1% agreement with the 
most recent experimental results available to him. 

An interesting postscript appears in the later correc- 
tion to Newton's calculation of the speed of soundSSiSSiSl. 
Other scientists recognized Newton's fudging for what it 
was, and the discrepancy between the theoretical and 
experimental results for the speed of sound remained an 
outstanding problem for over 100 years. In the early 
1800's, Laplace, armed with recent experimental results 
on specific heat capacities of gasses by Delaroche and Be- 
rard, gave an adiabatic treatment of sound waves, deriv- 
ing a speed of sound differing only 2.5% from experiment. 
But not only did Laplace use the now-discredited caloric 
theory in his determination of 7, but we now know that 
the experimental results of Delaroche and Berard are 12% 
off the corrent values^! The two errors ended up can- 
celing, giving a spurious agreement. Here the agreement 
seems to be fortuitous, rather than the result of bias. 



IV. THE BANDWAGON EFFECT 

In other cases, experimenter bias is clear, but it is 
more difficult to identify precisely where the bias crept 
in. When, in 1915, Einstein and de Haas performed the 
first experiments measuring the gyromagnetic ratio of the 
electron, they expected, on the basis of their models, to 
find a (/-factor of 5 = 1. We know today that the correct 
value is roughly 2, yet Einstein and de Haas obtained 
1.02 ± O.IC^^! Around the same time, the American sci- 
entist Barnctt independently did two sets of experiments 
on magnetism, and found g = 2.0, and g = 2.3. But after 
hearing of the results of Einstein and de Haas, Barnett 
repeated his experiments, and then reported that g was 
between 1.1 and 1.4, stating that 1.0 was within his error 
bars. Subsequent experiments by other groups did ob- 
tain values closer to 2. However, de Haas, in three more 
experiments in 1915 and 1923, continued to obtain values 
significantly below 2 (1.2, 1.11, and 1.55). It seems clear 
that Einstein, de Haas, and Barnett, were somehow influ- 
enced by their expectations. A very enjoyable history of 
this episode, as well as the already-mentioned Millikan 
"birth cry of the atoms" episode, can be found in the 
book by Galison^^. Galison speculates that Einstein and 
de Haas may have unintentionally corrected for system- 
atic errors in a biased fashion, fixing systematic errors 
that led to higher values of g, but leaving alone ones that 
led to lower values of g. 

A similar problem can be seen in the history graphs 
produced by the Particle Data Group's Review of Particle 
Properties. This group produces a report of all physical 
properties every two years, and their history graphs show 
reports of certain particle properties as a function of the 
year of their report. Ideally, the reported values should 
be randomly scattered about the constant correct value. 
However, some of the graphs show distinct trends with 
time^^. The 1980 Particle Data Group explained this as 
follows^: 

We show these figures not only to demonstrate that 
there is not much change in these averages in the 
usual case, but also to show that there exist cases 
with relatively large changes. There is a psycholog- 
ical danger in preparing tables of "right" answers. 
The old joke about the experimenter who fights the 
systematics until he or she get the "right" answer 
(read "agrees with previous experiments"), and then 
publishes, contains a germ of truth (presumably, 
those who compile and average experimental results 
are also not immune to this disease). A result can 
disagree with the average of all previous experiments 
by five standard deviations and still be right! Hence, 
perhaps it is of value to show that large changes can 
(and do) sometimes occur. 

Franklin terms these time-dependent shifts and trends 
"bandwagon effects," and discusses a particularly dra- 
matic case with | ry_| | , the parameter that measures CP 

violation'^ ^. As seen in Fig. ^ measurements of | ry^ | 
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FIG. 1: Measurements of | r;^ | in order of their year of pub- 
lication. Reprinted with permission from A. Franklin, "Forg- 
ing, cooking, trimming, and riding on the bandwagon," Am. 
J. Phys. 52, 786-793 (1984), copyright 1984, American Asso- 
ciation of Physics Teachers. 



before 1973 are systematically different from those after 
1973. 

Henrion and Fischhoff graphed experimental reports 
on the speed of light, as a function of the year of the ex- 
periment. Their graph (Fig.l^J shows that experimental 
results tend to cluster around a certain value for many 
years, and then suddenly jump to cluster around a new 
value, often many error bars from the previously accepted 
value- . Two such jumps occur for the speed of light. 
Again, the bias is clearly present in the graph, but the 
exact mechanism by which it occured is harder to discern. 



V. RESULTS THAT ARE TOO GOOD TO BE 
TRUE 

Newton's results on the speed of sound, and Millikan 
and Cameron's results on cosmic rays, are in retrospect 
too good to be true, because they were working with 
deficient or incorrect theories. But even results that agree 
with correct theories can be too good to be true. 

Henrion and Fischhoff looked at the reported values 
for a number of other fundamental constants, and with 
one exception, found that error bars had been systemat- 
ically underestimated — results were more spread about 
the correct value than they should have been, based on 
the reported error barg"^^. Underestimating error bars 
is not necessarily a sign of experimenter bias. However, 
their one exception is. Unlike the fundamental constants 
such as h or c, for which there are no theoretical cal- 
culations to tell experimenters what to expect, experi- 
ments measuring the ratio of the absolute ohm to the 
ohm as maintained by the National Bureau of standards 
i^ABs/^NBs) "should" obtain one, if all calibrations 
are correct. Henrion and Fischhoff found that the experi- 
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FIG. 2: Measurements of the speed of light, as a function of 
the year of the experiment. Note that the vertical scale is 
changed at 1900, to make the errors bars visible. Reprinted 
with permission from M. Henrion and B. Fischhoff, "Assessing 
uncertainty in physical constants," Am. J. Phys. 54, 791- 
798 (1986), copyright 1986, American Association of Physics 
Teachers. 



mental results for flABs/^NBS were too closely clustered 
around one — given the error bars reported, the chance 
was only 1 in 200 that the results would be so clustered 
together. This indicates that experimenters, knowing 
that they should obtain one, somehow biased their mea- 
surements or calculations to get "the right answer." 

We depart briefly from physics to consider a famous 
case from biology. Fisher's reexamination of Mendel's 
results in his famous experiments on heredity in peas, 
showed that Mendel's results were simply too good to 
be true. Many of his results show ratios that almost ex- 
actly match the theoretical values, with deviations much 
smaller than the expected statistical fluctuations. Fisher 
showed that only 3 experiments in 100,000 would obtain 
results as close to theory as Menders'*-?^. This has often 
been taken as proof of fraud, either on the part of Mendel, 
or an assistant, but no evidence of fraud has been found, 
other than the fact that the data is too good to be true. 
As we have seen, such biases can arise in a number of ways 
without intentional fraud. For example, Root-Bernstein 
has shown that the results can be explained if Mendel 
was unconsciously biased in cases where the assignation 
of pea traits was ambiguous^^. Olby has pointed out that 
the results could be explained if Mendel kept track of the 
ratios as he collected data, and stopped recording when 
the ratios were what he expecte d '^^'^^ 
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VI. OTHER CASES 

Experimenter biases can also arise from ambiguities in 
when to discard data. If equipment was clearly malfunc- 
tioning during a run, data can no longer be used. But 
it will sometimes be unclear whether data should be dis- 
carded, and in ambiguous cases, whether the data agrees 
with expectations may unconsciously (or consciously) tip 
the experimenter in one direction or another. One of 
the most (in)famous cases is that of Millikan's oil drop 
experiments, where he both demonstrated the quantiza- 
tion of charge, and measured this quantized charge^. 
In his paper he claimed "/i is to be remarked, too, that 
this is not a selected group of drops but represents all of 
the drops experimented on during 60 consecutive days. . 

But later inspection of his lab notebook shows that 
he threw out results for a number of oil drops, in many 
cases after calculating their charge. For some oil drops 
the equipment was clearly malfunctioning. For others, 
the data failed certain consistency checks. And for a few 
drops, Millikan simply discarded them because the calu- 
lated charges were too far off from what he believed to be 
the correct value. The exclusion of the latter data points 
appears to cross the line into fraud, but the reasons used 
to discard data form a continuum from obviously reason- 
able to fraudulent, and illustrate the difficulty in deter- 
mining when it is appropriate to discard data. (Fraud 
is, of course, also a way in which experimenter beliefs 
can influence their results, but discussion of such cases is 
beyond the scope of this paper.) 

Here we have only discussed the ways in which theoret- 
ical beliefs can affect the actual results reported. Once 
the results are reported, theoretical beliefs can also af- 
fect how the results are interpreted, and whether they 
are believed and/or published. We will not discuss such 
cases here, except to note that it is not always easy to 
separate the cases where theoretical beliefs have affected 
the experiment, and when they have merely affected its 
interpretation. The difficulty of this distinction is well- 
illustrated by an episode in the development of the kinetic 
theory of gasses. 

In 1859 the scientific understanding of heat was still 
uncertain, and the atomic picture for materials was still 



controversial. Maxwell considered his now-famous model 
of a gas as a collection of billiard balls (atoms) undergoing 
elastic collisions, and came up with a startling result — 
he showed that with this model, the viscosity of a gas 
would be independent of its density"^^. This was sur- 
prising, since he and his colleagues, naturally, expected 
the viscosity to go to zero as the density went to zero. 
Maxwell asked Stokes about this, and Stokes assured him 
that experiments by Sabine on the damping of pendulums 
in air had demonstrated that the viscosity of air went to 
zero at low densities — apparently, a strong disconfirma- 
tion of the billiard ball model! Nevertheless, Maxwell 
continued his investigations, and eventually did exper- 
iments showing that gas viscosity was constant over a 
wide range of densities. With this triumph of the billiard 
ball model. Stokes went back and reanalyzed Sabine's 
experiment. Stokes came to realize that, in fact, his pre- 
vious analysis of Sabine's experiment has implicitly as- 
sumed that the viscosity was proportional to density at 
low densities, thus assuming the result he claimed to be 
proving. It could be argued that this is not a case where 
theory affected reported experimental results, but merely 
one where it affected the way in which results were in- 
terpreted; but the line is blurry, for Stokes and Maxwell 
both apparently believed that "the viscosity goes to zero 
as the density goes to zero," was the experimental result. 

We have seen that there are a number of ways in which 
the theoretical expectations of physicists can influence 
their results. This bias is certainly less prominent than in 
fields such as psychology and medicine, but it is clearly 
present, and important if we want to understand how 
physics works in practice. Finally, if the reader knows of 
any cases of expectation bias, either historical or recent, 
in their own fields, I'd love to hear about them! 
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